Below are examples of the theorems proved in Kleinberg's paper (https://www.cs.cornell.edu/home/kleinber/nips15.pdf)

i) Note:

Third example is incorrect as it currently stands.

ii) Generate data to cluster


In [1]:
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import make_blobs

data, labels = make_blobs(n_samples=100, n_features=2, centers=2,cluster_std=4,random_state=2)

plt.scatter(data[:,0], data[:,1], c = labels, cmap='coolwarm');



In [2]:
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import make_blobs

data, labels = make_blobs(n_samples=10, n_features=2, centers=2,cluster_std=3,random_state=5)

plt.scatter(data[:,0], data[:,1], c = labels, cmap='coolwarm');


1. K-Cluster Stopping Condition - Fails Richness Condition

k-cluster stopping condition: Stop adding edges when the subgraph first consists of k connected components.

Richness condition:Every partition of S is a possible output. To state this more compactly, let Range(f) denote the set of all partitions Γ such that f(d) = Γ for some distance function d. Richness. Range(f) is equal to the set of all partitions of S. In other words, suppose we are given the names of the points only (i.e. the indices in S) but not the distances between them. Richness requires that for any desired partition Γ, it should be possible to construct a distance function d on S for which f(d) = Γ.


In [3]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=1)
kmeans.fit(data)

plt.scatter(data[:,0], data[:,1], c = kmeans.labels_, cmap='coolwarm');


Given inputs, no distance function will allow this algorithm to classify the points into any desired partition. The only partition available is the one given.

Takeaway:

This model was unable to provide the various cluster partitions based on different distance functions. It thereby fails the "richness" condition.

In general, models with k-cluster stopping conditions will be similarly limited by being unable to create all possible partitions given various distance functions.

2. Distance-r Stopping Condition - Fails Scale-Invariance Condition

Distance-r stopping condition: Only add edges of weight at most r

Scale-Invariance: For any distance function d and any α > 0, we have f(d) = f(α · d).

DBSCAN Example

Max r for algorithm = 4


In [4]:
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=4,metric='precomputed')

distance = euclidean_distances(data,data)
distance_times_alpha = distance*1.1

Euclidean Distance * 1


In [5]:
db.fit(distance)

euc = db.labels_

plt.scatter(data[:,0], data[:,1], c = euc, cmap='coolwarm');


Euclidean Distance * 1.1


In [6]:
db.fit(distance*1.1)

euc_alpha = db.labels_

plt.scatter(data[:,0], data[:,1], c = euc_alpha, cmap='coolwarm');



In [7]:
#-1 or 1 values are differently classified by each model. Data point at i=i was differently classified.
euc - euc_alpha


Out[7]:
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

Takeaway:

The model was unable to provide the same clusters based on differently scaled data. It thereby does not have the property of scale invariance.

In general, models with distance-r stopping conditions will be similarly sensitive to data which is scaled.

3. Scale-α Stopping Condition - Fails Consistency Condition

Scale-α stopping condition: Let ρ ∗ denote the maximum pairwise distance; i.e. ρ ∗ = maxi,j d(i, j). Only add edges of weight at most αρ∗

Consistency condition: Consistency. Let d and d0 be two distance functions. If f(d) = Γ, and d0 is a Γ-transformation of d, then f(d0 ) = Γ.

Agglomerative Clustering Example


In [8]:
from sklearn.cluster import AgglomerativeClustering

In [9]:
agg_model_euclidean = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='complete')

agg_model_euclidean.fit(data)

plt.scatter(data[:,0], data[:,1], c = agg_model_euclidean.labels_, cmap='coolwarm');



In [10]:
agg_model_manhattan = AgglomerativeClustering(n_clusters=2,affinity='manhattan',linkage='complete')

agg_model_manhattan.fit(data)

plt.scatter(data[:,0], data[:,1], c = agg_model_manhattan.labels_, cmap='coolwarm');



In [11]:
#-1 or 1 values are differently classified by each model. Data points at i=0 and i=3 were differently classified.
agg_model_euclidean.labels_-agg_model_manhattan.labels_


Out[11]:
array([-1,  0,  0, -1,  0,  0,  0,  0,  0,  0], dtype=int64)

Takeaway:

The model was unable to provide the same clusters based on different distance metrics.

In general, models with Scale-α stopping conditions will be similarly sensitive to distances which are transformed.